Comparative Study on Sentence Boundary Prediction for German and English Broadcast News

نویسندگان

Yang Wang

Alexandre Nanchen

Alexandros Lazaridis

David Imseng

Philip N. Garner

چکیده

We present a comparative study on sentence boundary prediction for German and English broadcast news that explores generalization across different languages. In the feature extraction stage, word pause duration is firstly extracted from word aligned speech, and forward and backward language models are utilized to extract textual features. Then a gradient boosted machine is optimized by grid search to map these features to punctuation marks. Experimental results confirm that word pause duration is a simple yet effective feature to predict whether there is a sentence boundary after that word. We found that Bayes risk derived from pause duration distributions of sentence boundary words and non-boundary words is an effective measure to assess the inherent difficulty of sentence boundary prediction. The proposed method achieved F-measures of over 90% on reference text and around 90% on ASR transcript for both German broadcast news corpus and English multi-genre broadcast news corpus. This demonstrates the state of the art performance of the proposed method.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The need to create a media block for the convergence of overseas news networks

As a general diplomacy arm of the Islamic Republic of Iran, VoSiMa has extensive activities in international broadcasting of its radio and television programs. These programs are broadcast in different languages, such as English, French, Azeri, Arabic, and ... for regional and transnational audiences. The large volume of the organization's international activities is in the form of news and new...

متن کامل

Deriving document structure from prosodic cues

This study presents an approach for prosody-driven segmentation of speech data. The model is based solely on F0 contours and RMS envelopes. Phoneme or word information from a speech recognizer is unneccesary. Using data from German broadcast news, we show how this prosodic information can be exploited to retrieve structural information of the spoken text. The suitability of the CART-like algori...

متن کامل

Dynamic Conditional Random Fields for Joint Sentence Boundary and Punctuation Prediction

The use of dynamic conditional random fields (DCRF) has been shown to outperform linear-chain conditional random fields (LCRF) for punctuation prediction on conversational speech texts [1]. In this paper, we combine lexical, prosodic, and modified n-gram score features into the DCRF framework for a joint sentence boundary and punctuation prediction task on TDT3 English broadcast news. We show t...

متن کامل

Varying input segmentation for story boundary detection in English, Arabic and Mandarin broadcast news

Story segmentation of news broadcasts has been shown to improve the accuracy of the subsequent processes such as question answering and information retrieval. In previous work, a decision tree trained on automatically extracted lexical and acoustic features was trained to predict story boundaries, using hypothesized sentence boundaries to define potential story boundaries. In this paper, we emp...

متن کامل

Sentence Boundary Detection in Broadcast Speech Transcripts

This paper presents an approach to identifying sentence boundaries in broadcast speech transcripts. We describe finite state models that extract sentence boundary information statistically from text and audio sources. An n-gram language model is constructed from a collection of British English news broadcasts and scripts. An alternative model is estimated from pause duration information in spee...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2017

Comparative Study on Sentence Boundary Prediction for German and English Broadcast News

نویسندگان

چکیده

منابع مشابه

The need to create a media block for the convergence of overseas news networks

Deriving document structure from prosodic cues

Dynamic Conditional Random Fields for Joint Sentence Boundary and Punctuation Prediction

Varying input segmentation for story boundary detection in English, Arabic and Mandarin broadcast news

Sentence Boundary Detection in Broadcast Speech Transcripts

عنوان ژورنال:

اشتراک گذاری